Skip to content

feat: improve maintainers detection [CM-1033]#3908

Open
mbani01 wants to merge 12 commits intomainfrom
feat/improve_maintainer_file_detection
Open

feat: improve maintainers detection [CM-1033]#3908
mbani01 wants to merge 12 commits intomainfrom
feat/improve_maintainer_file_detection

Conversation

@mbani01
Copy link
Contributor

@mbani01 mbani01 commented Mar 10, 2026

What changed

Before

  • File discovery was a sequential scan of a hard-coded flat list (MAINTAINER_FILES: 13 entries, root-only, no recursion).
  • The first matching file was used — no ranking, no scoring, no fallback strategy.
  • README.md was in the candidate list and required a simple content check for the word "maintainer".
  • AI file-selection received a plain list of filenames with no signals to rank them.
  • extract_maintainers always started from scratch — no reuse of a previously found file.
  • compare_and_update_maintainers skipped all maintainers with github_username == "unknown", including those with a valid email; no email fallback for identity lookup.
  • candidate_files and ai_suggested_file did not exist in MaintainerResult or execution metrics.
  • The full-content AI extraction prompt was always built upfront, even when the content was going to be chunked.

After

Detection pipeline (4-step with fallback)

  1. Saved file reuse — if a maintainer file was found on a previous run, it is tried first before any scanning.
  2. Ripgrep recursive search + scoringrg scans the full repo for files matching 20 governance stems (MAINTAINERS, OWNERS, CODEOWNERS, GOVERNANCE, EMERITUS, etc.) across all depths and valid extensions. Each file is scored: exact known path (100), exact stem match (50), partial stem (25), plus +1 per governance keyword found in content. All candidates are returned sorted by score; the top one is analyzed.
  3. README guard — README files are rejected immediately (no AI call) unless their content contains the word maintainer.
  4. AI file-selection fallback — if the top candidate fails, the full repo file list is scanned, pre-filtered to governance-scored files (capped at 300) with the already-failed file excluded, and passed to AI as (filename, score) tuples. The prompt instructs the model to prefer higher scores, shallower paths, and to reject files inside vendor/, node_modules/, third_party/, external/, and similar third-party directories.

Bug fixes

  • compare_and_update_maintainers: the skip guard now only fires when both github_username and email are unknown/None (previously skipped all "unknown" usernames unconditionally). New maintainers identified by email now go through find_maintainer_identity_by_email as a fallback, matching insert_new_maintainers behaviour.
  • Extraction prompt for chunked content is now built lazily inside the else branch, avoiding a wasted string allocation on every large file.

Observability

  • MaintainerResult gains candidate_files: list[tuple[str, int]] and ai_suggested_file: str | None.
  • ServiceExecution metrics now record candidate_files (top-100 by score) and ai_suggested_file on every run.

Note

Medium Risk
Refactors the maintainer extraction pipeline to rely on recursive ripgrep-based discovery, scoring, and AI fallback, which can change which governance file is selected and affects runtime behavior/cost. Also adds a new runtime dependency (ripgrep) and persists more execution metadata, so failures or environment mismatches could impact maintainer processing.

Overview
Improves maintainer extraction by replacing the prior hard-coded maintainer filename scan with a multi-step pipeline: reuse the previously saved maintainer file when available, otherwise perform recursive ripgrep-based candidate discovery with filename/content scoring, then fall back to an AI-driven file picker fed with scored candidates.

Adds guards and fixes around maintainer ingestion: README candidates are rejected unless they mention maintainer, unknown usernames are only skipped when email is also unknown, and identity lookup now falls back to find_maintainer_identity_by_email when github_username is unknown. Extends MaintainerResult and ServiceExecution metrics to record top candidate_files and the ai_suggested_file, and updates the git-integration Docker image to include ripgrep.

Written by Cursor Bugbot for commit b38abdc. This will update automatically on new commits. Configure here.

@mbani01 mbani01 self-assigned this Mar 10, 2026
Copilot AI review requested due to automatic review settings March 10, 2026 15:41
@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves maintainer file detection in the git integration service by adding a multi-step discovery and analysis flow that combines static filename matching, dynamic ripgrep-based content search, and an AI fallback, while also surfacing more metadata about what was tried.

Changes:

  • Added ripgrep-based repo scanning (rg --files and keyword search) with fallback to os.walk, plus scoring/filtering of dynamic candidates.
  • Refactored maintainer extraction to prioritize a previously saved maintainer file, then analyze top candidates, then use AI file suggestion as a last resort.
  • Extended MaintainerResult and service execution metrics to include candidate_files and ai_suggested_file; added ripgrep to the Docker image.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
services/apps/git_integration/src/crowdgit/services/maintainer/maintainer_service.py New candidate discovery + fallback extraction flow; logs and metrics now include candidate/AI-suggested file metadata.
services/apps/git_integration/src/crowdgit/models/maintainer_info.py Adds new result metadata fields (candidate_files, ai_suggested_file).
scripts/services/docker/Dockerfile.git_integration Installs ripgrep in the runner image to support dynamic search.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@mbani01 mbani01 requested a review from joanagmaia March 10, 2026 16:42
@mbani01 mbani01 marked this pull request as draft March 10, 2026 17:54
mbani01 added 9 commits March 11, 2026 14:05
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
… detection

Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
…rd in content

Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
@mbani01 mbani01 force-pushed the feat/improve_maintainer_file_detection branch from bc8e3df to b4dd488 Compare March 11, 2026 14:05
mbani01 added 2 commits March 11, 2026 14:32
…improve prompt

Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
@mbani01 mbani01 marked this pull request as ready for review March 11, 2026 14:34
Comment on lines +52 to +126
KNOWN_PATHS = {
"maintainers",
"maintainers.md",
"maintainer.md",
"codeowners",
"codeowners.md",
"contributors",
"contributors.md",
"owners",
"owners.md",
"authors",
"authors.md",
"governance.md",
"docs/maintainers.md",
".github/maintainers.md",
".github/contributors.md",
".github/codeowners",
}

# Governance stems (basename without extension, lowercased) for filename search
GOVERNANCE_STEMS = {
"maintainers",
"maintainer",
"codeowners",
"codeowner",
"contributors",
"contributor",
"owners",
"owners_aliases",
"authors",
"committers",
"commiters",
"reviewers",
"approvers",
"administrators",
"stewards",
"credits",
"governance",
"core_team",
"code_owners",
"emeritus",
}

VALID_EXTENSIONS = {
"",
".md",
".markdown",
".txt",
".rst",
".yaml",
".yml",
".toml",
".adoc",
".csv",
".rdoc",
}

SCORING_KEYWORDS = [
"maintainer",
"codeowner",
"owner",
"contributor",
"governance",
"steward",
"emeritus",
"approver",
"reviewer",
]

EXCLUDED_FILENAMES = {
"contributing.md",
"contributing",
"code_of_conduct.md",
"code-of-conduct.md",
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those were mainly inferred from our processing history.

Copy link
Contributor

@joanagmaia joanagmaia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, I have a couple of questions and requests to make sure that we have some more metrics given that these are big changes on the current process.

Questions:

  • With the mechanism of only picking one file for analysis we are assuming that all maintainers information will only be in 1 file right? I'm not sure if we should make sure that we won't lose data because of it.

Requests:

  • Can we run the new mechanism in like 10 repos and see the accuracy? I would even say on the current issues we have opened on Insights as well to see if we have improved coverage https://github.com/linuxfoundation/insights/issues?q=is%3Aissue%20state%3Aopen%20maintainer
  • Can we prepare a monitor in metaplane that covers the amount of repositories where we can get maintainers data for? And also the amount of projects?
  • Can we test using the Haiku model for find_maintainer_file_with_ai since it would be a simpler task then the rest of the work?

MAX_AI_FILE_LIST_SIZE = 300

# Full paths that get the highest score bonus when matched exactly
KNOWN_PATHS = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also include SECURITY-INSIGHTS.md. It was supported before as well.
E.g. https://github.com/open-telemetry/opentelemetry-dotnet/blob/d54379e28c07db783452a33e119f1cdf8e7d96a6/SECURITY-INSIGHTS.yml#L13

}

# Governance stems (basename without extension, lowercased) for filename search
GOVERNANCE_STEMS = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also add:

My only concern here is that it seems that they use the community repo to manage some maintainers data. So here we might need to infer the repository based on the directory structure. Maybe it's too complex for us to want to support at least for now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's tricky when repo and maintainers are in different places, will check how we can support this easily

Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 4 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

role = maintainer.normalized_title
original_role = self.make_role(maintainer.title)
if github_username == "unknown":
if github_username == "unknown" and maintainer.email in ("unknown", None):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dict dedup silently drops "unknown" username maintainers

High Severity

new_maintainers_dict is built with {m.github_username: m for m in maintainers}, so when multiple maintainers have github_username="unknown", only the last one survives. Previously all "unknown" entries were unconditionally skipped, so the dedup was harmless. Now that the skip guard allows "unknown" usernames with valid emails through for email-based identity lookup, all but the last "unknown" maintainer are silently dropped before processing. Contrast with insert_new_maintainers, which iterates the list directly and processes every entry.

Additional Locations (1)
Fix in Cursor Fix in Web

"docs/maintainers.md",
".github/maintainers.md",
".github/contributors.md",
".github/codeowners",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Case mismatch prevents SECURITY-INSIGHTS.md from matching

Medium Severity

KNOWN_PATHS contains "SECURITY-INSIGHTS.md" in uppercase, but _score_filename lowercases candidate_path before checking membership. The lowercased "security-insights.md" will never match the uppercase entry. Additionally, no GOVERNANCE_STEMS entry matches "security-insights", so _ripgrep_search won't discover the file either. This file type is effectively unsupported despite being listed.

Additional Locations (1)
Fix in Cursor Fix in Web

line[2:] if line.startswith("./") else line
for line in output.strip().split("\n")
if line.strip()
]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty extension glob matches all files in listing

Medium Severity

VALID_EXTENSIONS includes "" (empty string), so _list_repo_files generates --iglob "*" which matches all files. Since multiple ripgrep --iglob include patterns use OR logic, this wildcard makes all other extension-specific patterns redundant. The function returns every file in the repo instead of filtering to document-like extensions. This degrades the Step 4 AI fallback: when no files score above zero, the first 300 files sent to AI will be arbitrary (likely source code) rather than text/document files.

Fix in Cursor Fix in Web

ai_cost = 0.0
maintainers_found = 0
maintainers_skipped = 0
candidate_files: list[str] = []
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type annotation mismatch for candidate_files variable

Low Severity

candidate_files is declared as list[str] but is later assigned maintainers.candidate_files which is list[tuple[str, int]] (path and score pairs). The annotation is misleading and inconsistent with MaintainerResult.candidate_files. This won't crash at runtime since Python doesn't enforce type hints, but it obscures the actual data shape written into the metrics dict.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants